Mapping Porpoise Illumina Genomic Data

Author

Yacine Ben Chehida

1 Introduction

This notebook presents a complete workflow for assessing the quality of Illumina whole-genome sequencing data generated for harbour porpoise and Dall’s porpoise populations. The objective is to evaluate the mapping quality of each sample, extract key quality control metrics, and visualise these statistics across individuals and populations.

The workflow integrates three main components:

  • Running Qualimap bamqc, used to compute mapping quality metrics, coverage statistics, and alignment summaries for each BAM file.

  • Automated extraction of QC metrics by parsing Qualimap reports and compiling the resulting values into structured text files.

  • Visualisation in R of mapping quality metrics and coverage distribution across individuals and populations.

Together, these steps provide a summary of mapping performance across the dataset used in X et al. 2026. The resulting plots allow rapid identification of low-quality samples, comparison of sequencing depth across groups, and verification of mapping consistency prior to downstream genomic analyses.

2 Visualize raw sampling

Code
setwd("/Users/yacinebenchehida/Desktop/porpoise_stuff/2025-2026")
dat = read.table("untitled folder/species.txt", comment.char = "")
colnames(dat) = c("ID","Species","Couleur")
dat

3 Number of samples per population

Code
df_counts = as.data.frame(table(dat[2]))
df_counts

4 Running Qualimap on mapped BAM files

This section shows the Slurm script used to run Qualimap bamqc on each BAM file. The command computes mapping quality metrics, alignment statistics, and coverage summaries that we will use in downstream visualization.

Code

#!/bin/bash

#SBATCH --time=25:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=15G 
#SBATCH --account=BIOL-SPECGEN-2018 
#SBATCH --job-name=map_sort_RG_dup_merge

##################################
# 1 - Load necessar softwares #
##################################
module load bio/Qualimap/2.2.1-foss-2019b-R-3.6.2

###########################################                              
# 2 - check mapping quality with qualimap #
###########################################
mkdir -p $path_results/2_Qualimap/"$1"

qualimap bamqc --java-mem-size=20G \
-bam $path_results/1_sorted_dedup_bam/$1/"$1"_sorted_dedup.bam \
-c \
-gd HUMAN \
-nt 8 \
-outdir $path_results/2_Qualimap/"$1" \
-outformat HTML  

5 Extracting useful Qualimap QC Metrics

Code

# Set working directory for collecting Qualimap summary files
WD="/Users/yacinebenchehida/Desktop/porpoise_stuff/2025-2026"
cd "/Users/yacinebenchehida/Desktop/porpoise_stuff/2025-2026/qualimap"

# Remove previous output files to ensure a clean re-run
rm ../MapQ.txt
rm ../Read_mapped.txt
rm ../Pct_read_mapped.txt
rm ../Mean_Cov.txt
rm ../std_Cov.txt
rm ../Pct_1X.txt
rm ../Pct_3X.txt
rm ../Pct_5X.txt
rm ../Pct_10X.txt
rm ../Pct_15X.txt
rm ../Pct_20X.txt
rm ../Pct_Proper_Pair.txt
rm ../Proper_Pair.txt

# Loop over all sample directories produced by Qualimap
for i in *
  do cd $i

      # Extract mean mapping quality
      echo -e $i"\t"$(cat genome_results.txt | grep "mean mapping quality" | awk '{print $5}')"\tMapQ" >> $WD/MapQ.txt

      # Extract number of mapped reads
      echo -e $i"\t"$(cat genome_results.txt | grep "number of mapped reads" | perl -pe 's/,//g' | awk '{print $6}')"\tRead_Mapped" >> $WD/Read_mapped.txt

      # Extract percent of mapped reads
      echo -e $i"\t"$(cat genome_results.txt | grep "number of mapped reads" | perl -pe 's/,//g' | awk '{print $7}' | perl -pe 's/\((\w+)(\.)(\w+)%\)/$1$2$3/g' | perl -pe 's/\(100%\)/100/g')"\tPct_Read_Mapped" >> $WD/Pct_read_mapped.txt

      # Extract properly paired read count
      echo -e $i"\t"$(cat genome_results.txt | grep "number of mapped paired reads (both in pair)" | perl -pe 's/,//g' | awk '{print $10}')"\tProper_Pair" >> $WD/Proper_Pair.txt

      # Compute percent of reads in proper pairs
      echo -e $i"\t"$(cat genome_results.txt | perl -pe 's/,//g' | awk '/number of mapped paired reads \(both in pair\)/{both=$10} /number of reads =/{total=$5} END{printf "%.2f", (both/total)*100}')"\tPct_Proper_Pair" >> $WD/Pct_Proper_Pair.txt

      # Extract mean coverage
      echo -e $i"\t"$(cat genome_results.txt | grep "mean coverageData" | perl -pe 's/X//g' | awk '{print $4}')"\tMean_Cov" >> $WD/Mean_Cov.txt

      # Extract std coverage
      echo -e $i"\t"$(cat genome_results.txt | grep "std coverageData" | perl -pe 's/X//g' | awk '{print $4}')"\tstd_Cov" >> $WD/std_Cov.txt

      # Extract coverage thresholds
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 1X"  | perl -pe 's/%//g' | awk '{print $4}')"\tPct1X" >> $WD/Pct_1X.txt
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 3X"  | perl -pe 's/%//g' | awk '{print $4}')"\tPct3X" >> $WD/Pct_3X.txt
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 5X"  | perl -pe 's/%//g' | awk '{print $4}')"\tPct5X" >> $WD/Pct_5X.txt
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 10X" | perl -pe 's/%//g' | awk '{print $4}')"\tPct10X" >> $WD/Pct_10X.txt
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 15X" | perl -pe 's/%//g' | awk '{print $4}')"\tPct15X" >> $WD/Pct_15X.txt
      echo -e $i"\t"$(cat genome_results.txt | grep "coverageData >= 20X" | perl -pe 's/%//g' | awk '{print $4}')"\tPct20X" >> $WD/Pct_20X.txt

    cd ..
done

6 Visualising Mapping Statistics for Each Sample

Code
##############################
# Create scientific notation #
##############################
scientific_10 <- function(x) {
  ifelse(x > 1000,
         parse(text = gsub("e\\+*", " %*% 10^", scales::scientific_format()(x))),
         x)
}

#########################
# Set working directory #
#########################
setwd("/Users/yacinebenchehida/Desktop/porpoise_stuff/2025-2026")

############################################
# Import and prepare the data for plotting #
############################################
data = system("cat *txt",intern = TRUE)             # read all extracted metrics
data = strsplit(data,"\t", fixed = TRUE)
data = do.call(rbind,data)
data = as.data.frame(data)
colnames(data) = c("Sample","Value","Statistics")
data$Value = as.numeric(data$Value)

species = read.table("untitled folder/species.txt") # reference species table
data$species = "ToBeAdded"

###################################################
# Assign the good reference genome to each sample #
###################################################
for (i in 1:dim(species)[1]){
  for (j in 1:dim(data)[1]){
    if (species[i,1] == data[j,1]) {
      data[j,4] = species[i,2]
    }
  }
}

###############################
# Set subspecies order (fixed) #
###############################
ssp_order <- c(
  "HP-BLS", "HP-UP-MA", "HP-UP-IB", "HP-NAT-KAT", "HP-NAT-NOS",
  "HP-NAT-BES", "HP-NAT-PBS", "HP-NAT-BAS", "HP-NAT-ICE", "HP-NAT-CAN",
  "HP-WPAC", "HP-PAC-BERING", "HP-EPAC-RU", "DP"
)

data$species <- as.character(data$species)
data$species <- factor(data$species, levels = ssp_order)

########################################
# Load species colours from metadata   #
########################################
color_ref = as.data.frame(unique(dat[,c(2,3)]))
colnames(color_ref) = c("species","color")
color_ref$species = as.character(color_ref$species)
data$col = "ToBeAdded"

###########################################
# Assign the color to use for each sample #
###########################################
for (i in 1:dim(color_ref)[1]){
  for (j in 1:dim(data)[1]){
    if (color_ref[i,1] == data[j,4]) {
      data[j,5] = color_ref[i,2]
    }
  }
}

color_code = color_ref

###############################################
# Final polishing before generating the plots #
###############################################
data = with(data, data[order(species),])   # reorder lines by species

data$Sample <- factor(                    # reorder samples
  data$Sample,
  levels = rev(unique(with(data, data[order(species),1])))
)

data$Statistics <- factor(                # enforce stats order
  data$Statistics,
  levels = c("MapQ","Read_Mapped","Proper_Pair","Pct_Read_Mapped",
             "Pct_Proper_Pair","Mean_Cov","std_Cov",
             "Pct1X","Pct3X","Pct5X","Pct10X","Pct15X","Pct20X")
)

##############################
# species palette            #
##############################
pal_species <- color_ref$color[match(levels(data$species), color_ref$species)]

##############################
# y-axis sample colours      #
##############################
map_samp <- unique(data[, c("Sample","col")])
map_samp$Sample <- factor(map_samp$Sample, levels = levels(data$Sample))
map_samp <- map_samp[order(map_samp$Sample), ]
col_y <- as.character(map_samp$col)

data_std = subset(data, data$Statistics == "std_Cov")

#################
# Lollipop plot #
#################
ggplot(data = data, aes(x = Value, y = Sample, color = species)) + 
  theme_Publication() + 
  geom_point(size = 2) +
  scale_color_manual(values = pal_species) +
  geom_segment(aes(x = 0, xend = Value, y = Sample, yend = Sample, color = species),
               size = 0.5, linetype = "dashed") +
  facet_grid(. ~ Statistics, scales = "free_x") +
  theme(axis.text.y = element_text(colour = col_y, face = "bold")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(panel.grid.major.y = element_blank()) +
  theme_bw() +
  theme(strip.background = element_rect(fill = "bisque1")) +
  theme(strip.text.x = element_text(size = 13)) + 
  scale_x_continuous(labels = scientific_10)

A few conclusions:

  • The mapping quality and the percentage of mapped reads are generally consistent across samples, with most individuals showing MQ > 55 and > 95 percent of reads mapped.

  • As expected, sequencing depth varies substantially among samples. Coverage ranges from 3.6X to 54X, with a mean of 11.9X and a standard deviation of 7.9X (inflated by the higher depth samples from Clemin et al. 2023).

  • Two samples from Clemin et al. 2023 require further attention:

    • A2016-05680: This individual belongs to HP-NAT-BES. Its mapping statistics raise concern: mapping quality is low (MQ = 33), mean coverage is only 3.9X, coverage variance is extremely high, and just 48 percent of the genome reaches at least 1X. These patterns suggest a technical issue during library preparation, sequencing, mapping, or possibly contamination. This sample should likely be excluded from downstream analyses.

    • 5013: This HP-NAT-BES sample. It has only 77 percent of the genome covered at 1X. Such incomplete coverage may introduce noise in downstream analyses. Given the large number of HP-NAT-BES individuals already available, this sample contributes little additional information, and excluding it would avoid potential artefacts without reducing population representation.

7 Mapping statistics across subspecies

7.1 Boxplots

Code
# define species order
data$species = factor(data$species, levels = ssp_order)

# define color palette 
couleur = pal_species

# boxplot of mapping statistics per subspecies
ggplot(data = data, aes(x = species, y = Value, color = species)) +
theme_Publication() +
geom_boxplot() +
scale_color_manual(values = couleur) +
facet_wrap(Statistics ~ ., scales = "free_y", ncol = 3) +
theme_bw() +
theme(
strip.text.x = element_text(size = 10, face = "bold", colour = "black"),
strip.text.y = element_text(size = 10, face = "bold", colour = "black"),
axis.text.x = element_blank(),
strip.background = element_rect(fill = "bisque1")
) +
theme(strip.text.x = element_text(size = 15)) +
scale_y_continuous(labels = scientific_10) +
xlab("Subspecies")

=> Most populations, except those in the North Pacific, have at least 50 percent of their genome covered at 5X.

7.2 Average per subspecies

Code
# compute summary stats for mean coverage per subspecies
data[data$Statistics == "Mean_Cov", ] %>%
group_by(species) %>%
dplyr::summarize(
Mean = mean(Value, na.rm = TRUE),
Sd   = sd(Value, na.rm = TRUE)
)